Topic Modelling using LDA(Latent Dirichlet Allocation)¶

LDA starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents. Reference: https://medium.com/intuitionmachine/the-two-paths-from-natural-language-processing-to-artificial-intelligence-d5384ddbfc18

I'll start by reading the data in from the CSV that was previously processed by another notebook.

In [1]:
import pandas as pd
import os

abstracts_df = pd.read_csv(os.path.join('data', 'processed', 'abstracts.csv'))
# https://www.nsf.gov/awardsearch/showAward?AWD_ID=2053734&HistoricalAwards=false
abstracts_df.dropna(subset=['award_id', 'abstract'], inplace=True)

In NLP tasks, usually, we need to normalize the texts before processing them. My normalization process consist in: converting the text in string, convert the text to lowercase, exclude the punctuation, apply a lemmatizer, and finally remove the words with less than 3 letters.

In [2]:
from nltk.corpus import stopwords  #stopwords
from nltk.stem import WordNetLemmatizer  # lemmatizer from WordNet
#from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
#stop-words
stop_words=set(nltk.corpus.stopwords.words('english'))


def normalize_text(text):
    # Convert to text
    normalized_text = str(text)
    # Convert to lowercase
    normalized_text = normalized_text.lower()
    # Convert the words in tokens separated by spaces and transform each word in its lemma
    # -- e.g., "criteria" -> "criterion"
    lemmatizer = WordNetLemmatizer()
    # "geocoordinates" -> "geocoordin"
    # stemmer = PorterStemmer() I avoid this because some words are cut
    word_tokens=word_tokenize(normalized_text)
    tokens=[lemmatizer.lemmatize(w) for w in word_tokens if w not in stop_words and len(w)>3]
    normalized_text=" ".join(tokens)
    return normalized_text
    
text = "National efforts to digitize natural history collections have transformed previously siloed, unstandardized resources into a networked, openly available information nexus usable to meet grand scientific and societal challenges. Despite these enormous strides, major bottlenecks in this digitization process still exist, especially in areas where automation approaches have been most challenging. In particular, capturing analog specimen data into digital format and converting text descriptions of collecting locations into mappable geocoordinates, have remained boutique efforts. Because of these bottlenecks, as many as 91% of digitized specimens are missing key elements that hamper ability to use these specimen records more effectively. This project will develop key workflows to dramatically  increase the speed at which specimen data can be captured and made available broadly to data providers and consumers.  These workflows include novel approaches that use both computer and human intelligence to advance our ability to capture specimen information.  One key workflow focuses on the challenge of automated conversion of imaged specimen labels into properly formatted and usable digital text.  Critical to the success of this workflow are human validation checkpoints that will be implemented using a popular citizen science platform, Notes from Nature.  A second workflow focuses on new tools that take advantage of previous efforts to assign mappable coordinates based on specimen collection location to automatically add such mapping information for specimens missing those data.  Finally, this effort will create tools for easy access to these new data in and out of common use databases, making the data immediately available for museum providers and researchers alike. This effort will connect public participation in science to these novel tools and technologies. Further, it will train diverse graduate students and undergraduate students in bioinformatics and museum science.<br/><br/>This effort has three design goals that together will dramatically reduce the digitization gap in museum specimen data. The first design goal will combine machine learning methods with public participation in scientific research (PPSR) via the successful Notes from Nature (NfN) project to speed up label digitization and facilitate obtaining locality data. A key part of the first design goal utilizes supervised machine learning approaches and object character recognition (OCR) when possible but also includes “humans in the loop” using the NfN platform to gather fast quality feedback from human volunteers at key points. This approach also provides a means to create high-quality training datasets needed for improving automation steps, ultimately further reducing human effort. The second design goal will integrate locality data interpretation through GEOLocate with a Biodiversity Enhanced Locality Service (BELS), which will make it possible to look up pre-existing localities that have been georeferenced using best practices. A third goal is to connect these workflows and services to Symbiota, a community digitization hub, to allow easy inflow and outflow of content back to digitization networks. Providers will be able to easily access new data along with associated metadata about processing steps, all returned using established standards and best practices. The key to this effort will be engagement with the community, including researchers, collections staff, and Zooniverse volunteers. Engagement will focus on virtual training and working with an advisory committee in order to grow capacity and community involvement.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."
normalized_text = normalize_text(text)

print(f'Original text: {text}')
print('-------------------------------------------------------------------------------------------')
print(f'Normalized text: {normalized_text}')

normalized_abstracts = abstracts_df['abstract'].apply(normalize_text)
[nltk_data] Downloading package punkt to /home/juan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/juan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/juan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/juan/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Original text: National efforts to digitize natural history collections have transformed previously siloed, unstandardized resources into a networked, openly available information nexus usable to meet grand scientific and societal challenges. Despite these enormous strides, major bottlenecks in this digitization process still exist, especially in areas where automation approaches have been most challenging. In particular, capturing analog specimen data into digital format and converting text descriptions of collecting locations into mappable geocoordinates, have remained boutique efforts. Because of these bottlenecks, as many as 91% of digitized specimens are missing key elements that hamper ability to use these specimen records more effectively. This project will develop key workflows to dramatically  increase the speed at which specimen data can be captured and made available broadly to data providers and consumers.  These workflows include novel approaches that use both computer and human intelligence to advance our ability to capture specimen information.  One key workflow focuses on the challenge of automated conversion of imaged specimen labels into properly formatted and usable digital text.  Critical to the success of this workflow are human validation checkpoints that will be implemented using a popular citizen science platform, Notes from Nature.  A second workflow focuses on new tools that take advantage of previous efforts to assign mappable coordinates based on specimen collection location to automatically add such mapping information for specimens missing those data.  Finally, this effort will create tools for easy access to these new data in and out of common use databases, making the data immediately available for museum providers and researchers alike. This effort will connect public participation in science to these novel tools and technologies. Further, it will train diverse graduate students and undergraduate students in bioinformatics and museum science.<br/><br/>This effort has three design goals that together will dramatically reduce the digitization gap in museum specimen data. The first design goal will combine machine learning methods with public participation in scientific research (PPSR) via the successful Notes from Nature (NfN) project to speed up label digitization and facilitate obtaining locality data. A key part of the first design goal utilizes supervised machine learning approaches and object character recognition (OCR) when possible but also includes “humans in the loop” using the NfN platform to gather fast quality feedback from human volunteers at key points. This approach also provides a means to create high-quality training datasets needed for improving automation steps, ultimately further reducing human effort. The second design goal will integrate locality data interpretation through GEOLocate with a Biodiversity Enhanced Locality Service (BELS), which will make it possible to look up pre-existing localities that have been georeferenced using best practices. A third goal is to connect these workflows and services to Symbiota, a community digitization hub, to allow easy inflow and outflow of content back to digitization networks. Providers will be able to easily access new data along with associated metadata about processing steps, all returned using established standards and best practices. The key to this effort will be engagement with the community, including researchers, collections staff, and Zooniverse volunteers. Engagement will focus on virtual training and working with an advisory committee in order to grow capacity and community involvement.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
-------------------------------------------------------------------------------------------
Normalized text: national effort digitize natural history collection transformed previously siloed unstandardized resource networked openly available information nexus usable meet grand scientific societal challenge despite enormous stride major bottleneck digitization process still exist especially area automation approach challenging particular capturing analog specimen data digital format converting text description collecting location mappable geocoordinates remained boutique effort bottleneck many digitized specimen missing element hamper ability specimen record effectively project develop workflow dramatically increase speed specimen data captured made available broadly data provider consumer workflow include novel approach computer human intelligence advance ability capture specimen information workflow focus challenge automated conversion imaged specimen label properly formatted usable digital text critical success workflow human validation checkpoint implemented using popular citizen science platform note nature second workflow focus tool take advantage previous effort assign mappable coordinate based specimen collection location automatically mapping information specimen missing data finally effort create tool easy access data common database making data immediately available museum provider researcher alike effort connect public participation science novel tool technology train diverse graduate student undergraduate student bioinformatics museum science. effort three design goal together dramatically reduce digitization museum specimen data first design goal combine machine learning method public participation scientific research ppsr successful note nature project speed label digitization facilitate obtaining locality data part first design goal utilizes supervised machine learning approach object character recognition possible also includes human loop using platform gather fast quality feedback human volunteer point approach also provides mean create high-quality training datasets needed improving automation step ultimately reducing human effort second design goal integrate locality data interpretation geolocate biodiversity enhanced locality service bel make possible look pre-existing locality georeferenced using best practice third goal connect workflow service symbiota community digitization allow easy inflow outflow content back digitization network provider able easily access data along associated metadata processing step returned using established standard best practice effort engagement community including researcher collection staff zooniverse volunteer engagement focus virtual training working advisory committee order grow capacity community involvement. award reflects statutory mission deemed worthy support evaluation using foundation intellectual merit broader impact review criterion

The input for LDA is a bag of words where each document is a row and each column has the count of words in the corpus.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_text = vectorizer.fit_transform(normalized_abstracts)
# (num_abstracts, num_words)
print(vectorized_text.shape)
(13159, 43923)
In [4]:
from sklearn.decomposition import LatentDirichletAllocation

num_topics = 10
lda_model=LatentDirichletAllocation(
    n_components=num_topics,
    learning_method='online',
    random_state=92
)

lda_topics = lda_model.fit_transform(vectorized_text)
print(lda_topics.shape)  # (num_abstracts, num_topics)
(13159, 10)

After calculating the topics, we can get the top 10 words associated with each topic

In [5]:
# Most important words for each topic
vocabulary = vectorizer.get_feature_names_out()
n_top_words = 100
topic_word_freq = {}

for index, component in enumerate(lda_model.components_):
    vocab_comp = zip(vocabulary, component)
    sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:n_top_words]
    import_words = [x[0] for x in sorted_words]
    topic_word_freq[index] = import_words[:100]
    print(f"Topic {index}: {', '.join(import_words[:10])}")
    print("\n")
Topic 0: covid, 19, project, impact, technology, health, broader, disease, using, virus


Topic 1: quantum, theory, project, problem, mathematical, study, research, using, equation, award


Topic 2: model, project, ocean, earth, climate, using, process, impact, temperature, change


Topic 3: data, project, system, model, learning, research, network, using, impact, design


Topic 4: student, research, project, stem, support, program, science, education, learning, university


Topic 5: research, project, data, community, change, impact, social, using, study, support


Topic 6: material, research, project, high, property, using, structure, impact, energy, award


Topic 7: cell, protein, project, plant, research, gene, biology, biological, function, using


Topic 8: water, carbon, chemical, soil, chemistry, project, organic, energy, process, reaction


Topic 9: wave, physic, star, award, using, particle, plasma, galaxy, energy, matter


Sometimes, wordclouds are better to identify the topics.

In [6]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def generate_wordcloud(tup):
    wordcloud = WordCloud(background_color='white',
                          max_words=50, max_font_size=40,
                          random_state=42
                         ).generate(str(tup))
    return wordcloud


fig,axes = plt.subplots(5, 2, figsize=(15, 25))

for i in range(5):
    for j in range(2):
        ax = axes[i, j]
        ax.imshow(generate_wordcloud(topic_word_freq[5*j + i]), interpolation="bilinear")
        ax.axis('off')
        ax.set_title(f"Topic {5*j + i}", fontsize=30)

We'll reduce the dimensions of the topics to see them in a 2D graph.

In [7]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1)
# reduce dimension to 2 using tsne
tsne_lda = tsne_model.fit_transform(lda_topics)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 13159 samples in 0.009s...
[t-SNE] Computed neighbors for 13159 samples in 0.615s...
[t-SNE] Computed conditional probabilities for sample 1000 / 13159
[t-SNE] Computed conditional probabilities for sample 2000 / 13159
[t-SNE] Computed conditional probabilities for sample 3000 / 13159
[t-SNE] Computed conditional probabilities for sample 4000 / 13159
[t-SNE] Computed conditional probabilities for sample 5000 / 13159
[t-SNE] Computed conditional probabilities for sample 6000 / 13159
[t-SNE] Computed conditional probabilities for sample 7000 / 13159
[t-SNE] Computed conditional probabilities for sample 8000 / 13159
[t-SNE] Computed conditional probabilities for sample 9000 / 13159
[t-SNE] Computed conditional probabilities for sample 10000 / 13159
[t-SNE] Computed conditional probabilities for sample 11000 / 13159
[t-SNE] Computed conditional probabilities for sample 12000 / 13159
[t-SNE] Computed conditional probabilities for sample 13000 / 13159
[t-SNE] Computed conditional probabilities for sample 13159 / 13159
[t-SNE] Mean sigma: 0.000000
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.629623
[t-SNE] KL divergence after 1000 iterations: 1.161142
In [8]:
import numpy as np

unnormalized = np.matrix(lda_topics)
doc_topic = unnormalized/unnormalized.sum(axis=1)

lda_keys = []
for index in range(abstracts_df.shape[0]):
    lda_keys += [doc_topic[index].argmax()]

lda_df = pd.DataFrame(tsne_lda, columns=['x','y'])
lda_df['abstract'] = abstracts_df['abstract']
lda_df['award_id'] = abstracts_df['award_id']
lda_df['topic'] = lda_keys
lda_df['topic'] = lda_df['topic'].map(int)
lda_df
Out[8]:
x y abstract award_id topic
0 -1.275147 16.423231 National efforts to digitize natural history c... 2027234.0 3
1 33.612213 5.749017 An award is made to the Natural History Museum... 2018207.0 4
2 -6.057844 24.984032 Current software for user authentication relie... 2039373.0 3
3 25.468138 15.642528 This collaborative project comprised of ten aw... 2001394.0 5
4 27.775192 32.858318 Cyberlearning technologies that incorporate ro... 2030441.0 4
... ... ... ... ... ...
13154 7.274016 -32.160351 Recent advances in artificial intelligence (AI... 2008228.0 5
13155 29.610926 -9.462381 Data visualization is a key component to disco... 2006710.0 4
13156 64.865776 -51.218269 The broader impact/commercial potential of thi... 2035899.0 2
13157 -38.417030 -18.448893 Gamma-ray astronomy impacts a broad range of k... 2013109.0 6
13158 40.571548 -8.988461 Controlling cell differentiation is critical w... 2033997.0 4

13159 rows × 5 columns

In [9]:
import bokeh.plotting as bp
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import show, output_notebook

output_notebook()
plot_lda = bp.figure(
    plot_width=700,
    plot_height=600,
    title="LDA topic visualization",
    tools="pan,wheel_zoom,box_zoom,reset,hover",
    x_axis_type=None, y_axis_type=None, min_border=1)


colormap = np.array(["#6d8dca", "#69de53", "#723bca", "#c3e14c", "#c84dc9", "#68af4e", "#6e6cd5",
"#e3be38", "#4e2d7c", "#5fdfa8", "#d34690", "#3f6d31", "#d44427", "#7fcdd8", "#cb4053", "#5e9981",
"#803a62", "#9b9e39", "#c88cca", "#e1c37b", "#34223b", "#bdd8a3", "#6e3326", "#cfbdce", "#d07d3c",
"#52697d", "#194196", "#d27c88", "#36422b", "#b68f79"])


source = ColumnDataSource(data=dict(x=lda_df['x'], y=lda_df['y'],
                                    color=colormap[lda_keys],
                                    abstract=lda_df['abstract'],
                                    topic=lda_df['topic'],
                                    award_id=lda_df['award_id']))

plot_lda.scatter(source=source, x='x', y='y', color='color')
hover = plot_lda.select(dict(type=HoverTool))
hover.tooltips={"abstract":"@abstract",
                "topic":"@topic", "award_id":"@award_id"}
show(plot_lda)
Loading BokehJS ...

I'll use the pyLDAvis library to expand the interpretation of the topics via an interactive tool.

In [10]:
def prepareLDAData():
    data = {
        'vocab': vocabulary,
        'doc_topic_dists': doc_topic,
        'doc_lengths': list(lda_df['len_docs']),
        'term_frequency':vectorizer.vocabulary_,
        'topic_term_dists': lda_model.components_
    } 
    return data

import pyLDAvis

lda_df['len_docs'] = abstracts_df['abstract'].apply(lambda x: len(word_tokenize(x)))
ldadata = prepareLDAData()
pyLDAvis.enable_notebook()
prepared_data = pyLDAvis.prepare(**ldadata)
pyLDAvis.display(prepared_data)
Out[10]:

References¶

  • Mercari Interactive EDA + Topic Modelling
  • Topic Modelling using LDA and LSA in Sklearn)